We will analysis the Single cell RNA-seq data(gene count) from Human.There are 1099 sample in the origin file.Before analysis,I had combined them into a dataframe and save it to txt file(R or shell).Based on the infornmation from the sample xlxs file(sample message file),we know that there are four control condition in some sample.And we will show them bellow.

Load the packages

library(Seurat)
library(data.table)
library(NMF)
library(rsvd)
library(Rtsne)
library(ggplot2)
library(cowplot)
library(sva)
library(igraph)
library(cccd)
library(KernSmooth)
library(beeswarm)
library(stringr)
library(formatR)
source("tools.R")
library(DESeq2)

The function will be used in the follow

We have two steps to analysis:

Condition message Step 1: Ignore control condition,use all sample Step 2:Under the control condition Positive Negative Tube_1 Tube_2

Step 1: All data: Analysis based on sample group

Read data

Data QA

human.only.pro <- Load_data(data_dir = "./human.txt")
important.genes <- c("ITGB4", "ABCB5", "KRT19", "ACTB", "KRT12", "KRT5", "GAPDH", 
    "KRT3", "PAX6", "WNT7A", "KRT14", "TP63", "KRT10")

table(unlist(lapply(colnames(human.only.pro), function(x) return(str_split(x, 
    "_")[[1]][2]))))
## 
## 10um 20um  2um  6um 
##  326  575   18  180
table(unlist(lapply(colnames(human.only.pro), function(x) return(str_split(x, 
    "_")[[1]][1]))))
## 
##    hc001    hc006    hc009    hc012    hc017    hc018    hc020    hc021 
##       15       92      187       66      188      188      184      158 
## shoutiao 
##       21

Compared to other cell size group, remove 2um size samples

Create Seurat object and not caculate DESeq,but not set min.cells and min.genes

# only select the cells contain 10 genes expressed at least,select the genes
# must be expressed in two cells at least
human.all.DESeq <- DESeq_SeuratObj(X = human.only.pro, DESq = FALSE, min.cells = 10, 
    min.genes = 2)
## [1] "Scaling data matrix"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

all.sample.group <- unlist(lapply(human.all.DESeq@cell.names, function(x) return(str_split(x, 
    "_")[[1]][1])))
all.sample.size <- unlist(lapply(human.all.DESeq@cell.names, function(x) return(str_split(x, 
    "_")[[1]][2])))
# reset ident human.all.DESeq<-SetIdent(human.all.DESeq,cells.use =
# human.all.DESeq@cell.names,ident.use = all.sample.size)

Figure Explore

First,use the plot,eg. Barplot,Violin…,we can explore some message from sample

Group_Bar(human.all.DESeq@raw.data, group = all.sample.group)

Group_Bar(human.all.DESeq@raw.data, group = all.sample.size)

# We are interested in the gene ITGB4
GenePlot(human.all.DESeq, gene1 = "ITGB4", gene2 = important.genes[2])
# VlnPlot(human.all.DESeq,features.plot = 'ITGB4',y.lab.rot = 90) # Violinn
# plot of gene ITGB in all sample
VlnPlot(human.all.DESeq, features.plot = important.genes[important.genes %in% 
    rownames(human.all.DESeq@raw.data)], y.lab.rot = 90)  # Violinn plot of gene ITGB in all sample

According to the plot of explore data,we can know that the gene ITGB4, ABCB5, KRT19, ACTB, KRT12, KRT5, GAPDH, KRT3, PAX6, WNT7A, KRT14, TP63, KRT10 are expressed differently across the sample group.It is fit to what we are interested.And,across the sample(Barplot),there are many genes that actually expressed different.The barplot of sample group tells us that the gene expression level in hc001 is almost 0.So we consider remove the sample group hc001. Next,based on the explore plot,we will analysis data deeply with other method

According to barplot across sample group,remove the hc001 group.

## [1] "Scaling data matrix"
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Dimensionality reduction

PCA and tSNE

Here,do the dimensionality reduction using the PCA, tSNE method 
all.pbmc <- PCA.TSNE(object = human.all.DESeq, pcs.compute = FALSE, num.pcs = 28)

After the PCA and tSNE,try plot: Featureplot of ITGB4,four var.genes,PCA plot,tSNE plot…

# FeaturePlot(object = all.pbmc,features.plot ='ITGB4',pt.size = 4,no.legend
# = FALSE) # ITGB4 gene in part dataset
FeaturePlot(object = all.pbmc, features.plot = important.genes[important.genes %in% 
    rownames(human.all.DESeq@raw.data)], pt.size = 1, no.legend = FALSE, reduction.use = "pca")  # ITGB4 gene in part dataset

DimPlot(all.pbmc, reduction.use = "tsne", pt.size = 4)  #  grour by sample

DimPlot(all.pbmc, reduction.use = "pca", pt.size = 4)  #  grour by sample

DimHeatmap(all.pbmc, reduction.type = "pca", check.plot = FALSE)

FeatureHeatmap(all.pbmc, features.plot = "ITGB4", pt.size = 3, plot.horiz = TRUE, 
    cols.use = c("lightgrey", "blue"))

The Faetureplot of ITGB4, ABCB5, KRT19, ACTB, KRT12, KRT5, GAPDH, KRT3, PAX6, WNT7A, KRT14, TP63, KRT10based on PCA shows that,they only has high expression level in few samples,and expresss lowly in most sample.It means that may be these important genes express differently across sample.The plot also tell us the gene KRT5,GAPDH,PAXX6,KRT14 have more higher expression level than the other important genes.It is consistent with the result of violin plot. About the heatmap,we only show the gene ITGB4 And the FeatureHeatmap and Heamap also comfirm this phenomeno.We try the other four variable genes,which has the similar result as gene ITGB4 But the tSNE and * PCA * plot show that, the sample can not be split apparently.The result may be is not good based on the PCA and tSNE method.

Differential expression

Next,we will have analysis on gene differential expression.Find maker genes across sample.We use the method: **wilcox test**
# Finds markers (differentially expressed genes) for each of the identity
# classes in a dataset
all_markers <- FindAllMarkers(all.pbmc, test.use = "bimod", print.bar = FALSE)
head(all_markers)
##                    p_val  avg_logFC pct.1 pct.2     p_val_adj cluster
## MTRNR2L1   1.360725e-131 -2.2312801 0.783 0.509 2.792752e-127   hc006
## AC009501.4 2.734415e-117 -1.7443481 0.717 0.266 5.612113e-113   hc006
## MTATP6P1    5.006631e-73 -2.1743262 1.000 0.907  1.027561e-68   hc006
## BEST1       1.157891e-69  0.8045069 0.859 0.049  2.376456e-65   hc006
## PEX26       2.194607e-65 -0.5001337 0.522 0.084  4.504211e-61   hc006
## ZNF664      1.789094e-64  0.4046469 0.935 0.123  3.671936e-60   hc006
##                  gene
## MTRNR2L1     MTRNR2L1
## AC009501.4 AC009501.4
## MTATP6P1     MTATP6P1
## BEST1           BEST1
## PEX26           PEX26
## ZNF664         ZNF664

We check whether the important genes are still in the marker genes we found from the DESeq analysis. the genes:ITGB4, KRT19, ACTB, KRT5, GAPDH, KRT3, PAX6, KRT14, TP63, KRT10 are still in the marker genes.

Bar plot of gene’s p.val for the most significant expressed genes

Heatmap for important genes group by sample batch

human.heatmap <- Heatmap_fun(genes = important.genes[important.genes %in% rownames(human.all.DESeq@raw.data)], 
    tpm.data = all.pbmc@scale.data, condition = unique(as.character(all.pbmc@ident)), 
    all.condition = as.character(all.pbmc@ident))
## There ara 8 conditions
## Whether creat data accurate 0
NMF::aheatmap(human.heatmap[[2]], Rowv = NA, Colv = NA, annCol = human.heatmap[[1]], 
    scale = "none")

We have find all marker genes across sample,there are 5065 significant genes(adjust p-value <0.05) in all marker genes.

Next,Spectral k-means clustering on single cells based on PCA

all.pbmc <- KClustDimension(all.pbmc, reduction.use = "pca", k.use = 9)
clusters.pca <- all.pbmc@meta.data$kdimension.ident
DimPlot(all.pbmc, pt.size = 4, group.by = "kdimension.ident")

Spectral k-means clustering on single cells based on tSNE

all.pbmc <- KClustDimension(all.pbmc, reduction.use = "tsne", k.use = 9)
clusters.tsne <- all.pbmc@meta.data$kdimension.ident
DimPlot(all.pbmc, pt.size = 4, group.by = "kdimension.ident", reduction.use = "tsne")

Step 3: another approach to handle gene count data and use the DESeq testto detect genes Differential expression.

When use the DESeq,it must require the gene count matrix satisify that: every gene contains at least one zero, cannot compute log geometric means. So have to take another method to handle data,but I do not know whether it is reasonable.Just try!!!

DESeq test

it will take a long time

# sample group
condition.1 <- unlist(lapply(all.pbmc@cell.names, function(x) return(str_split(x, 
    "_")[[1]][1])))
# condition.2<-unlist(lapply(colnames(human.only.pro),function(x)return(str_split(x,'_')[[1]][2])))
load("Human.sample.RData")
plotDispEsts(xdds, main = "Per-gene Dispersion")

Show the DESeq result between group hc006 and hc009: the differently expressed genes

res <- results(xdds, contrast = c("condition.1", "hc009", "hc006"))
res[which(res$padj < 0.05), ]
## log2 fold change (MLE): condition.1 hc009 vs hc006 
## Wald test p-value: condition.1 hc009 vs hc006 
## DataFrame with 12296 rows and 6 columns
##         baseMean log2FoldChange     lfcSE       stat       pvalue
##        <numeric>      <numeric> <numeric>  <numeric>    <numeric>
## A2M     2.289971       2.206884 0.2321168   9.507642 1.950334e-21
## A2ML1   1.262714      -0.798983 0.1739448  -4.593313 4.362640e-06
## A4GALT  2.169455      -1.159624 0.1904962  -6.087384 1.147705e-09
## AAAS    2.185565      -1.773275 0.1911290  -9.277899 1.728531e-20
## AAGAB   7.465588      -2.233763 0.2809110  -7.951854 1.837409e-15
## ...          ...            ...       ...        ...          ...
## ZXDC    2.039650     -1.5112369 0.1839614  -8.214967 2.122222e-16
## ZYX    10.361767     -3.4962954 0.2837317 -12.322539 6.850287e-35
## ZZZ3    4.549581      0.9134068 0.2591173   3.525070 4.233704e-04
## snoU13  3.382409     -0.8434911 0.2193144  -3.846036 1.200443e-04
## uc_338  1.142337     -0.4314247 0.1669928  -2.583493 9.780544e-03
##                padj
##           <numeric>
## A2M    1.553356e-20
## A2ML1  1.049662e-05
## A4GALT 3.725465e-09
## AAAS   1.292053e-19
## AAGAB  9.438512e-15
## ...             ...
## ZXDC   1.179148e-15
## ZYX    1.281859e-33
## ZZZ3   8.481449e-04
## snoU13 2.545476e-04
## uc_338 1.660705e-02

the table are the result which gene’s padj value<0.05.Include the variable:baseMean,log2FoldChange lfcSE,stat,pvalue,padj.

Do the DESeq test across all cells with sample group.And get all the significant genes between two groups(p.value < 0.05)

Venn diagram

library(VennDiagram)
grid.draw(venn.diagram(x[1:4], filename = NULL, fill = c("dodgerblue", "goldenrod1", 
    "darkorange1", "darkorchid1")))

Venn diagram show that the common differentially expressing genes across gorups.The number represents the the number of common genes.Here just show the randomly selected groups